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Abstract. The use of the Principal Component Analysis technique for the analysis 
of geophysical time series has been questioned in particular for its tendency to extract 
components that mix several physical phenomena even when the signal is just their 
linear sum. We demonstrate with a data simulation experiment that the Independent 
Component Analysis, a recently developed technique, is able to solve this problem. 
This new technique requires the statistical independence of components, a stronger 
constraint, that uses higher-order statistics, instead of the classical decorrelation, a 
weaker constraint, that uses only second-order statistics. Furthermore, ICA does not 
require additional a priori information such as the localization corstraint used in 
Rotational Techniques. 
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1. Introduction 

This work concerns methods for the identification of the physical causes of 
variability of a given dynamical system from observations of its behavior. In many 
cases, an observed time series is produced by a mixture, linear or nonlinear, of different 
components representing different physical phenomena. In the linear case, the time 
series x(j j with temporal dimension, N, at particular spatial coordinate (we will call 
this a pixel), j € {1, . . . , M}, where M is the spatial dimension, is decomposed ,n time 
as: 

x{j) = G a = g x oi{j) + g 2 o 2 (j) + ■ ■ ■ + 9q^qU), (1) 

where the temporal base functions, g 1} . . . ,gq, the columns of matrix G, are unknown 
time series describing a fixed dynamical behavior (vectors and matrices are indicated in 
bold characters). In this paper, we consider decomposition in time, but the following 
discussion would be the same for a decomposition in space. Each g { could be a signal 
with a different physical cause operative in a particular geographical region represented 
by a different score map {^(j) ; j = 1, . . . M }. One goal of an analysis is then to infer 
the unknown contributing components from the observed data, x. In the linear case 

h = J • x ~ <r, (2) 

where J is an estimate of the unknown matrix G _l and h is an estimate of the 
unknown vector <x. Statistical analysis methods that estimate J and h are called 
component extraction techniques. Their ability to retrieve good estimates, h, of the true 
components, <r, is highly dependent on the quality of the statistical dataset used (i.e. 
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sufficiently large number of independent examples sampling all the variations involved) 
and on the technical assumptions that are made about J and h. 

A common approach is to require the decorrelation of the extracted components (i.e., 
any two components are orthogonal): the covariance matrix of extracted components 

< h • h > is constrained to be diagonal; but this decorrelation principle has an infinity 
of solutions: 

J — & ■ Jq, (3) 

where © is an undetermined Q x Q matrix so that ©‘ • © = I QxQ . J Q = E~ 1/2 • E l 
is a Q x N matrix with E the truncated diagonal matrix of the higher eigenvalues of 

< x l x > and E the N x Q matrix with the associated normalized eigenvectors in the 
columns. 

One particular decorrelation solution is the well-known Principal Component 
Analysis (PCA) or, in the geophysical community, Empirical Orthogonal Function 
(EOF), first used in atmospheric sciences by Lorenz (1951). In this technique, an 
additional constraint is added to resolve the indeterminacy of the decorrelation solutions: 
the successive extracted components have to explain the maximum remaining variance. 
This solution is given by taking © = I QxQ in Equation (3). Three problems could 
arise in using the PCA teechnique. (1) Even if the mixing of the components is linear 
as in Equation (1), this maximum-explained-variance assumption can cause mixing 
of different physical phenomena in the extracted components (Kim and Wu, 1999) as 
we will show here (see Figure 1-A for an schematic illustration of this problem in a 
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2-dimentional case). (2) This mixing problem is also particularly serious when the 
PCA is applied to data that have more than one component with about the same 
variance. In this case, the problem is not solvable since any orthogonal rotation of the 
principal components will be a PCA solution (Figure 1-B). (3) Since PCA imposes the 
orthogonality of the base functions it extracts, mixing problems also arise when the 
actual physical base functions are not orthogonal (Figure 1-C). 

The PCA assumptions (linearity, variances explained by the components or 
orthogonality) used to resolve the solution indeterminacy are not known, a priori, to be 
valid for a particular dataset. It these PCA assumptions are not valid, variations that 
are not physically connected could be artificially gathered together into one extracted 
component. This is the reason why PCA is often used in restricted geographical 
domains instead of global domains to try to isolate a single dominant mode of variation, 
wich PCA can correctly identify. Consequently, even if PCA is useful as a tool for 
compressing information by describing the most variance with the fewest terms in an 
expansion, it can lead to misinterpretation of physical relationships. 

Rotational Techniques (RT) were introduced (Horel 1981, Richman 1981), in part, 
to obtain a solution more physically interpretable. In these approaches, an additional 
constraint of localization, based on the so called “simple structure” principle, is used 
to solve the indeterminacy of the decorrelation solutions. There exist many proposed 
criteria for this purpose: quartimax, varimax, transvarimax, quartimin, oblimax, etc 
(Richman 1986). Two distinct classes of RT solution could be distinguished: the 
Orthogonal Rotations that preserve the orthogonality constraint of the components, 
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and the Oblique Rotations that relaxe this constraint. However, the localization criteria 
used in both family of solutions are quite subjective and use of such a priori information 
may not be well suited to all applications. 

The Independent Component Analysis (ICA) method, briefly described in Section 
2, is based on information theory and has been recently developed in the context of 
signal processing studies and of the development of neural coding models (Bell and 
Sejnowski 1995). The two major distinctions between the ICA approach and the 
classical techniques are: 


• The assumption of a linear mixture model is not required so an orthogonality 
constraint is not applied. However, the example presented here happens to be one 
where the signal is composed of the linear sum of components as in (1). 

• The method extracts statistically independent components, even if these 
components have a non-Gaussian probability distribution function, by making 
use of higher order statistics, whereas the PCA or RT approaches use only 
second-order statistics. 

We argue that the ICA approach is a particularly promising technique which may 
overcome the main pitfalls of the standard techniques (PCA or RT) for geophysical time 
series analysis. When faced with observations of a system with unknown dynamics, 
identifying the statistically independent variation mode seems more likely to be 
meaningful than assuming, a priori, that the system is composed of linearly mixed, 
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orthogonal modes unless such modes ca be shown to be present form other information 
(North 1984). Moreover, if one or two modes are not known, a priori, to be dominant, 
then maximizing the variance explained by each mode as in PCA produces inappropriate 
mixing. Using other subjective criteria also seems difficult to justify. Hence, the classical 
techniques can be used in situations where additional information is available to justify 
their assumptions as useful approximations, but they cannot be used to search for new 
understanding of the system dynamics because they make such strong assumptions 
about it. ICA, by finding statistically independent modes, may provide a better way 
to explore the dynamics of a system, like the atmosphere-ocean system, that is known 
to involve non-linear coupling of many modes across a wide range of space-time scales; 
however, even statistical independence is only a guide to the system’s behavior. 

A previous study that applies ICA to the analysis of variations of tropical sea 
surface temperature (Aires et al. 2000) illustrates the potential of the ICA technique 
to separate a geophysical time series into more meaningful components. To illustrate 
more clearly how ICA handles the component mixing problem, we construct a synthetic 
dataset, where the true answer to the decomposition problem is known, and apply both 
PCA and ICA to extract components (Section 3). We deliberately devise a dataset to 
test whatever PCA can separate distinct modes of variation that are added linearly as 
many practitioners appear to expect. We show that, even in the case of a linear sum of 
components, the PCA technique mixes the contributions, but that the ICA method can 
correctly separate the components without additional subjective constraints like those 


used in RT. 
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2. The Independent Component Analysis technique 

In this section, we briefly review the main concepts underlying the Independent 
Component Analysis (ICA) technique. For more details, the interested reader is referred 
to Bell et al. (1995) and Aires et al. (2000). The ICA technique aims to extract 
statistically independent components, a stronger constraint than the decci relation 
requirement of the classical techniques. The statistical independence of uvo variables h\ 
and h 2 is determined when their joint distribution can be factored: 

P(hi,h 2 ) = P(h l )-P(h 2 ). (4) 

This constraint involves higher-order statistics whereas the decorrelation constraint only 
involves second-order statistics. Decorrelation is equivalent to statistical independence 
only in the Gaussian case. So the higher-order statistics are particularly important 
when the analyzed data have components with non-Gaussian distributions. Avoiding 
the a priori assumption that second-order statistics are sufficient is important when 
the components are unknown as is usually the case. It is also important to not confuse 
the non-Gaussian character of the components with the non-Gaussian character of 
the data itself; however, if the data have a non-Gaussian distribution, then at least 
one component is also non-Gaussian, since for the simplest linear mixture of Gaussian 
components, the distribution would be Gaussian (a non-linear combination of Gaussian 
distributions could be non-Gaussian). Some previous studies examine this non-Gaussian 
behavior in the data (Burgers and Stephenson 1999, Aires et al. 2000). Without a 
priori information on this matter, the use of ICA is recommended since its requirement 
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of statistical independence is more general than the decorrelation assumption. 

The time series observations are gathered into a dataset X / of M observations 
x(j) = (x/ ; t = 1, . . . , N) with j 6 {1, . . . , M}, where M is the spatial dimension 
of the time series and N is its temporal dimension. The times series x(j) is assumed 
to be a mixture, linear or nonlinear, of several statistically independent components 

a = {at; i = l,...,Q}: 

x{j) = A(cr(j)) (5) 

wh<re A is an unknown mixture function, which is, by hypothesis, non-singular (i.e. it 
can be inverted). 

The goal of ICA is to retrieve a function $ : x — >■ h, where h is an estimate of a 
and the terms {hi ; i = 1, . . . , Q} are statistically independent. The estimate, h, is 
defined as a deterministic function (linear or not) of the observations: 

h t = $i(W u x) ; i = (6) 

where {Wi ; i = 1, . . . , Q} is the set of parameters of <I>. The number of components, 
Q, is here supposed to be known (this number can be estimated by a break in the 
frequency spectrum of the data, for example). With real observations, Q depends 
on the analysis objectives: extracting a lot of components allows for more complete 
description of the variability but makes the interpretation much more complicated, 
whereas extracting fewer components focuses attention on fewer different phenomena at 
the cost of explaining less of the variability. The interested reader should refer to an 
article by Nadal et al. (1999). 
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The parameters, W x , are estimated by applying a gradient descent algorithm to 
a cost function that specifies the statistical independence of the {h t ; i = 1, . . .,Q}. 
Different equivalent cost functions could be used; we focus here on the infomax approach 
to ICA (Nadal and Parga 1994) from which simple algorithms have been derived (Bell 
and Sejnowski 1995). Information theory is used to specify the statistical independence 
cost function: the fundamental quantity used here is information redundancy. Given Q 
variables, hi, h 2 , . . . , /iq, the information redundancy 7l(hi,h 2 , . . . , /iq) is defined as the 
Kullback diverge nee between the joint distribution P(h u h 2 ,..., h Q ) and the factorized 
distribution P(h ,) • P(h 2 ) ■ ■ . P(/iq): 


R(h> = f “ i°g 

J -°° <= 1 nr=i Pi{hi 


) 


(7) 


This information redundancy comes from information theory and measures the 
difference between the joint and the factorized distribution: when the redundancy 
TZ(h) = 0, Ph{h) = riHi Pi{hi ), which means, by the definition in equation (4), that the 
components of vector h are statistically independent. 

The use of a gradient descent algorithm to minimize this cost function is interesting 
since it allows for the introduction of any 0 priori information about the solution that 
may be available. For such a purpose, a second term in the quality criterion is added 
that represents any additional constraint(s) on the solution. For example, this additional 
information could be a constraint on the shape of the solutions, on the distance of 
the solution from a first guess, or on the regularity properties of the solution. Such a 
regularization approach, also used in variational assimilation methods for example, is a 
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classical way of using all the information that is available. 

A nonlinear regression model for the extraction model in Equation (6) has to be 
specified. The Multi-Layer Perceptron (MLP) is often chosen: this artificial Neural 
Network model is preferred for its nonlinear behavior. In this experiment, we use a 
simple MLP architecture with no hidden layer. This neural mapping is defined by (from 
right to left in Figure 2): 

V = f(h) = f(Jx), ( 8 ) 

where /, the logistic function, is only used for algorithmic considerations. The extracted 
components are not the outp it y or the neural mapping (8), but the vector h = J ■ x. 
We use this model because the mixture model is linear (the nonlinear mixture case will 
be the subject of future work). 

With the redundancy reduction criterion and no-hidden-layer architecture, an 
algorithmic implementation of the ICA has been found (Bell and Sejnowski 1995): 


A J tk oc J ik + yt ■ 

i 


(9) 


where 


d dy j 
dyi dh { 


_ d , 
dhi W 


) = 1 - 2 y,. 


This algorithm is described in a more practical way in the appendix. 1 


( 10 ) 


^ee also the web page http://www.cnl.salk.edu/ tewon/ica.cnl.html of the Compu- 
tational Neuroscience Laborabory of Terry Sejnowski at The Salk Institute for links to 
recent literature, software and demos concerning the ICA paradigm. 
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3. Application to a linear sum of components 

a. Construction of the synthetic dataset 

The synthetic dataset used in this study is generated to mimic the results of that 
are often expected from a PCA decomposition. For this purpose, Q = 6 components 
representing six different dynamical phenomena are used. Each component is described 
bj a different temporal base function g ^ (solid lines in Figure 3) constructed from 
composites of sinusoids with different frequencies and phases. These base functions have 
been normalized to give a standard deviation of unity; i.e. each component accounts 
for a similar amount of temporal variance. The temporal dimension of these base 
functions is taken to be N = 365 (e.g., one year of daily data). A spatial resolution 
of 2.5° x 2.5° is chosen, corresponding to M = 144 x 72 = 10368 pixels. Finally, the 
dataset, Xj = {x(j) 6 R N ; j = 1, . . . , Af}, where R N is the space of real vectors of 
dimension N, N = 365 and M = 10368, is formed from the time series x(j) for each 
pixel j by the linear sum of the base functions, x(j) = g^a i(J + . . . + g Q a Q (j) + e 
(linear model of Eq. 1). The term € is a Gaussian-distributed noise (zero mean and 
standard-deviation of 0.5), representing very noisy data. 

The {cri(j) ; i = 1, . ..,<?} indicate the strength of each component, z, at each 
pixel, j. These strengths are constructed to have a geographical Gaussian distribution, 
giving a different ellipsoidal distribution for each component (left column in Figure 4). 
Land contours are shown for easier description of the modes. One of the components 
has two peaks in its spatial distribution to represent a teleconnection pattern (map of 
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component 1 in left column of Figure 4), so the total number of ellipsoidal peaks is 
seven. Also, the geographical extent of two of the components overlaps in the Indian 
Ocean (maps of components 4 and 6 in left column of Figure 4) to complicate the 
component extraction process. 

The variance explained by the <5 = 6 components and the variance from the added 
noise are shown in Table 1. The i ariance explained by the components represents 67 % 
of the total variance and the noise represents 33 %. The variance of a component results 
from the combination of the temporal variability of t ie base function as a function of 
amplitude (that has been normalized here) and frequency, and the spatial extent of the 
component. 

b. Results of PCA and IC A 

The PCA components are determined by computing the matrix J 0 . The best 
number of PCA components to extract is determined by observing the spectrum of 
cumulative percent of variance explained by the PCA components (Figure 5). The first 
<5 = 6 PCA components represent 67.7 % of the total variance and the 359 remaining 
components that explain 32 3 % of the total variance represent the noise in the dataset. 
The temporal PCA base functions (crossed lines in Figure 3) are each compared with 
the real base function to which it best corresponds. PCA base functions 2, 3 and 4 
provide a relatively good estimate of the true functions, although there are some errors 
near the peak values. PCA temporal base functions 1, 5 and 6 (low frequencies) are 
much worse fits. In particular, higher frequencies have been mixed in with the real 
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base functions. The corresponding PCA score maps are defined at pixel j by the values 
(crj, . . - (Tq)(j) = J o • X-j* and are shown in Figure 4 (middle column), where X * is 
the j th column of data matrix X . We see that the PCA (or EOF) technique confuses 
elements from the different components, the general mixing problem, such that all 
of its components exhibit many more geographic peaks than in the real components. 
Even if the corresponding PCA temporal base function is relatively well retrieved, the 
corresponding score map still exhibits the mixing problem (see PCA base function No 2 
in Figure 3 and the corresponding PCA score map in Figure 4). One cause of the mixing 
is well illustrated in Table 1 where the variance explained by each. PCA components 
is compared to the variance of the actual components. The first PCA component 
explains 24.4 % of the total variance, which is much more than the true variance of 
13.3 %. The variance maximization constraint on the solution in PCA means that the 
first component is the mixture of many true component variabilities, whereas the 6 th 
PCA component represents only 3.1 % which is a considerable under-estimate of the 
real value of 10 %. The noise level estimate of 32.3 % is a good estimate, but its small 
under-estimate of the real noise is due to the projection of noise into the first 6 PCA 
components (representing 0.7 %). 

This mixing tendency of the PCA could suggest many more teleconnections in 
observations than are actually present. Since all six components contribute the same 
amount of variance, the PCA technique has combined many of the actually-separate 
components into several of its components, trying to maximize the amount of variance 
explained by each. However, the method is then compelled to alternate positive and 
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negative values to compensate for having too much variance when the components are 
added back together. This effect is especially apparent for the overlapping components 
in the Indian Ocean: two PCA base functions possess broad central peaks spanning 
the geographic distribution of both of the real components and two others possess, in 
this same location, two opposite-signed peaks (see PCA score maps of components 1,4, 
5 and 6 ; n Figure 4, middle column). A similar projection of real components into 
more than one PCA component occurs when a geographically isolated mode moves 
during the time period (Kim and Wu 1999). Moreover, the component with two peaks 
in the Americas, representing a real teleconnection, shows up in four of the 3 CA 
components (components 1, 3, 4 and 6 in Figure 4, middle column), but mixed with 
other components as well, suggesting teleconnections between the Americas and the 
South Atlantic and Indian Oceans that do not exist. 

For the ICA, a “whitening” procedure is applied first using the PCA solution: the 
observed data, x(j), are projected onto the first Q = 6 PCA components using the 
matrix J 0 . The ICA technique is then applied to the “whitened” data, x(j ) = J 0 • x(j ) 
(dimension Q = 6 instead of N — 365). This is equivalent to performing an oblique 
rotation on the PCA initial solution, see (Nadal et al. 1999, Aires et al. 2000). Thus 
the 6 ICA extracted components explain the same amount of variance as the 6 PCA 
components (67.7 %). 

The six ICA base functions are shown in Figure 3 (dotted lines). The ICA base 
functions are very similar to the real base functions; this comparison shows how the 
ICA technique has corrected its first guess (the PCA solution) to be closer to the true 
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temporal base functions. The additional information conveyed by the requirement of 
statistical independence is nicely illustrated: the ICA solution is better than the PCA 
solution for all six components. The ICA score maps are presented in Figure 4 (right 
column). The presence of large- amplitude noise amplifies the ambiguity, producing 
weak mixing of the components (see the score map for component 5, for example); but 
generally, the components are well retrieved and separated, even the teleconnection 
mode (ICA component 1 in Figure 4, right column) and the two overlapping modes in 
the Indian Ocean (ICA component 4 and 6 in Figure 5, right column). The ICA score 
maps are always an improvement over the PCA solution. When the noise is removed, 
the ICA separates the original six modes very cleanly with very little mixing; thus, in 
practice, if the noise amplitude is significantly smaller than the signal amplitude, the 
ICA solution is very close to the real solution. 

Table 1 shows that the variance explained by the ICA components is closer to 
the real solution than the PCA components. Differences between the true and ICA 
explained variance for each components are less than 0.6 %. Discrepancies are the result 
of the projection of some part of the noise into the ICA components. 

4. Concluding remarks 

PCA (or EOF analysis) provides an economical way to summarize or characterize 
the complex time and regional variations of a geophysical parameter over the whole 
globe with only a few functions. When the total time variance is dominated by one 
or two separate orthogonal modes that add approximately linearly and with different 
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variances, then this analysis can also correctly identify such modes. However, PCA is 
now routinely applied to the study of atmospheric observation anomalies, which are 
defined as deviations from the average leading mode, the annual cycle. There are three 
Problems. (1) The annual cycle, itself, cannot always be described by a single EOF 
mode (e.g., Rossow et al. 1993), indicating that its phase and/or regional distribution 
are varying over the time record "o that the mean annual cycle is only an approximation 
to the actual annual cycle in a given year. (2) The several anomalies that have been 
identified by removal of the mean annual cycle generally only account for a few percent 
of the total variance 1 and are not a lot larger than data noise. (3) The fluid dynamics 
of the atmospheric circulation strongly couples motions on different space and time 
scales, i.e., the separate modes combine non-linearly to produce the total time signal. 
The latter fact means that the annual cycle is constantly interacting non-linearly with 
atmospheric variations on a whole range of other time scales, so that the removal of 
the mean annual cycle from a dataset leaves behind a residual that is only a partial 
representation of these physical interactions, producing an artificial mode from the 
analysis method. Thus, there is every reason to suspect that the assumptions that 
underlie the PCA are violated when applied to atmospheric circulation anomalies. This 
leads to the worry that the anomalies that have been identified and studied are not 
physical, but are either created by the mixing of many separate physical modes having 
similar amplitudes, but different regional distributions and interacting non-linearly, or 
by the mixing (aliasing) of the annual cycle into other modes by observation errors. Our 
simple example shows how extra modes can be generated by regional overlaps between 


two different modes and how spurious teleconnections can be found where none exist 
or distorted where they do exist, even when the modes add linearly, (i.e., favorable 
condition for the PCA). 

Our simple linear example also 'shows the potential of the ICA technique for 
separating a complex signal in a more meaningful way. The mixing problem inherent 
in the PCA technique and the artifacts prod' iced by the orthogonality and maximum 
variance constraints of PCA are avoided with the ICA approach. Moreover, the use of 
higher-order statistics to determine statistical independence assumes much less about 
the character of the parameter distributions than the PCA. In the case where the 
components of the system are linked in a certain way (the climate is certainly closer to 
this model), the ICA would be able to extract components that are a kind of prototype, 
defined to be as statistically distinct as possible, for an optimal description of the 
variability in the observations. This means that ICA, solving the mixing problem, is 
more suitable for global studies, which is not the case of PCA. 

As with the classical PCA technique, this first ICA algorithm is not able to deal 
correctly with propagating components. But the ICA paradigm may be a sufficiently 
general concept to-be used in a more sophisticated way like complex ICA or nonlinear 
ICA. Our experiment on synthetic data, where the solution of the component extraction 
problem is known, encourages further work to extend the ICA paradigm to non-linear 
cases: this requires development of non-linear solution algorithms and testing for cases 
where the combination of modes is non-linear, when components are physically linked, 
and for cases with propagating modes. 
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Appendix: Principal steps of the algorithm 

We adopt here the linear model x = G ■ a, where x is the observation, G is the 
base function matrix and <r is the vector of components to estimate. The goal of the 
statistical decomposition technique is to estimate a matrix J = G _I , the filter matrix, 
using only a dataset of observations {x e ; e = 1, . . . ,E}, where E is the number of 
samples in the dataset. With this matrix J, for each observation x, the components a 
are estimated by h = J • x, and the base function matrix G is estimated by the inverse 
matrix J . 

The principal steps of the time series analysis by the ICA technique are: 

• Pre-processing of dataset: The dataset X / = {x(j) € R N ; j = 1, • ■ ■ , M}, where 
t is the time index and j is the space index (geographical locations), often requires a 
pre-processing step: 

- spatial, temporal or spatio-temporal interpolation to resolve missing data 
problems, 

- filtering of data to suppress undesirable frequencies, 

- de-trending to obtain a stationary data, 

- removing the annual cycle to examine interannual anomalies. 

None of these steps is required. For example, we have commented in the paper on the 
dangers of removing the annual cycle. 

• Chose the space for the decomposition: 
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in time, which is the approach we have adopted in our study: x{j) — 

9\0\{j) + • • • + 9q°q (j) + e, 

- in space, 

- in frequency, 

- in a mixture of these spaces. 

The observations (a time series or a geographical field, . . . ) are noted, in the following, 
by the d-dimensional vector x e and the dataset is {x e ; e = 1, E }. 

• Center the dataset: The observation mean < x e > is removed from the dataset: 

x * x<! ~ < xe > This step is necessary for statistical techniques where data are 
supposed to have zero-mean. 

• Normalize the dataset: If the user wants to put the same statistical weight on each 
coordinate of the observation x e (that could be a date in time decomposition, a pixel 
location in space decomposition, . . . ): observations in the dataset are normalized by the 
standard-deviation vector x e <— x e fe x . 

• Eigen-vector decomposition: The covariance (or correlation, in the case of 
normalized observations) matrix < x l ■ x > is estimated from the dataset. The 
eigen-values A (diagonal matrix) and the eigen-vector matrix V of < x £ • x > are then 
computed using a classical numerical routine. The number of PCA or ICA extracted 
components Q is chosen by observing the spectrum of eigen-values. 


• PCA solution: 
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- The d x Q PC A base function matrix G P c.\ contains in its columns the first 
Q eigen-vectors of V (the columns of V represent time series in the time 
decomposition, and geographical field in the space decomposition, . . . ). 

- Since, by definition, V~ l = V 4 , the filter PCA matrix Jpca is equal to the 
transposed Q x d base function matrix Gpca ■ Then, the extracted components h 
that estimate the true components cr are the projection of the observation x onto 
the filters: h = J pc a * *• 

- The first Q eigen-values in A represents the variability explained by each of the Q 
components. 

• ICA solution: 

1 Pre-whitening of dataset: The PCA solution is used for a pre-processing data step: 
the observations x e are projected into the PCA filters: x e <— J pca • x * • The ICA 
algorithm is then applied into these Q-dimensional data. 

2 The ICA solution J IC a is initialized as the identity matrix I QxQ . This, associated 
to the previous whitening step of data, is equivalent to taking the PCA solution 
as first guess for ICA. 

3 For the minimization of the criterion specifying the statistical independence, a 
stochastical gradient descent algorithm is used. The stochastical principle uses 
the gradient descent formula iteratively in unique random samples of the dataset, 
contrary to the classical approach where the gradient descent formula is applied 
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iteratively to the global dataset (i.e. a deterministic algorithm). The stochastical 
character of an optimization algorithm allows theoretically, under some constraint 
not discussed here, for the optimization technique to reach the global minimum of 

the criterion instead of the local minimum where a deterministic algorithm could 
be trapped. 

4 An observation x e is randomly chosen in the dataset. The propagation through 
the neui al network (chosen model for the extraction component) is given by: 
y = f(h) = /(Jica ■ x e ), Where /(a) = 1/1 + exp{-/3 ■ a) is the logistic function 
(/? is a parameter controlling the slope of the logistic function, we take /? = 2.0). 
The FORTRAN routine of this process is: 
c - - propagation into the neural network 

do i = 1, d 

h{i) = O.dO 
do k = 1, d 

h(i) = h(i) + JicA{h k) * x e (k) 
enddo 

h(i) = h(i) + bia(i) 

: y(i ) = I.cf0/(l.d0 + dexp(~P * h(i ))) 
enddo 

where bia is the classical bias vector in a MLP neural, network (not shown in the 
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text for simplicity). We use, in this routine, double precision variables to avoid 
numerical instabilities. 

5 The learning process is then defined as: 
c - - transitory quantities 
do j = 1 ,d 

hhh(j) = O.dO 

do k = l,d 

hhh{j) = hhh(j) + JicA{k,j) * h(k ) 
enddo 
enddo 

c - - modification of weights 
do i = 1, d 

do j = 1, d 

JicAihj ) = JicA{iJ)+param* 
k (JiCAihj) + P* (l.dO - 2. dO * y{i )) * hhh{j)) 
enddo 

bia(i ) = bia(i) + (3 * (l.dO - 2.d0 * y(i)) 


enddo 
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where par am is the learning param. ter of the gradient descent optimization (we 
take param = 0.0005). 

6 Stopping criterion: many criteria can be used to define when stopping the previous 
learning steps. The simplest criterio.i is to determine a priori the number of 
learning steps. A better adequate cr terion is the measure of the difference 
between solution J ica at time t and at time t + 1: if this difference is low, 
the algorithm has converged. Ar othei stopping criterion is the measure of the 
statistical independence of the e>tract( I components h: cumulants (i.e. additive 
higher-order moments) are a practical 1 - ay to do that, but this approach is still 
computationally expensive. The learning algorithm returns to step 4 until the 
stopping criterion is reached. 

• Analysis of results: When the matrix J\ C a has been determined by ICA, the global 
ICA filters (taking into account the PCA pre-processing) are defined by the Q x d 
matrix: J glo = J ica • J pca 

- The projection of data is used to estimate .he components: h = Jglo • x e 

- The d x Q ICA base function matrix Gql y = J glo~ 1 = Gpca • J ica ~ 1 is 
normalized to obtain normalized ICA base functions, as in PCA approach. 

- Computation of explained variance of each of the base functions. 
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Figure 1 . Problems encountered by PC A when observations have dimension 2 (coor- 
dinates X and Y) and come from two components defining ellipses El and E2, the line 
D represents the first PCA axe defining the first PCA component: A) mixing due to 
the maximum explained constraint, B) indeterminacy when two components have same 

variance, and C) mixing due to the non-orthogonality of components. 

Figure 2. T re component extraction model: the perceptron architecture, where x is 

the observation, h is the extracted component vector and h is the ouput network. 
Figure 3. Temporal base functions g^: ACTUAL (solid lines), PCA estimates (crossed 

lines) and ICA estimates (points). 

Figure 4. The components score maps of actual components Oi (left column), of the 

PCA extracted components h t (middle column), and of the ICA extracted components 

hi (right column): components number 1-6 from the top to the bottom, colors from the 

blue to the red (rainbow) representing range 0-1 for actual amplitudes (left colum) and 

-3 to +3 for extracted components amplitudes (middle and right column) 

Figure 5. Cumulative percent of explained variance by the PCA components 



Table 1 . Variance explained by 


Noise, REAL, PC A and ICA com- 
ponents 


Component 

REAL 

PCA 

ICA 

1 

13.3 

24.4 

12.7 

2 

12.6 

14.5 

13.0 

3 

10.7 

10.7 

11.3 

4 

10.7 

8.8 

10.2 

5 

10.7 

6.2 

10.3 

6 

10.0 

3.1 

10.0 

Noise 

33.0 

32.3 

32.3 
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